i.i.d. data (previous x,y pair does not affect next x, y pair)
known ground truth outputs in the training
Problem:
Cannot adapt if something fails.
Reinforcement Learning:
Data is not i.i.d: previous outputs influence future inputs!
Ground truth answer is not known, only know if we succeeded or failed (but we generally know the reward)
Compared to traditional reinforcement learning,
With DRL(Deep Reinforcement Learning), we can solve complex problems end-to-end!
But there are challenges:
Humans can learn incredibly quickly
Deep RLs are usually slow
probably because humans can reuse past knowledge
Not clear what the reward function should be
Not clear what the role of prediction should be
Fundamental Theorem of RL(Lec 4)
Notations + Markov Property
We are actually trying to learn a distribution of action a given the observation o, where θ is the parameter of the distribution policy π. we add subscript t to each term to denote the index in the time-serie since we know in RL that actions are usually correlated.
Sometimes instead of πθ(at∣ot) we will see πθ(at∣st).
Difference - can use observation to infer state, but sometimes (in this example, car in front of the cheetah) the observation does not give enough information to let us infer state.
State is the true configuration of the system and observation is something that results from the state which may or may not be enough to deduce the state
🔥
The state is usually assumed to be an Markovian state
Markov property is very important property ⇒ If you want to change the future you just need to focus on current state and your current action
However, is your future state independent of your past observation? In general, observations do not satisfy Markov property
Many RL algorithms in this course requires Markov property
Side note on notation
RL world (Richard Bellman)
Robotics (Lev Pontryagin)
st state
xt state
at action
ut action
r(s,a) reward
c(x,u) cost =−r(x,u)
In general, we want to minimize:
θminEs1:T,a1:T[t∑c(st,at)]
We can either write the funciton c as c or r, which stands for either (minimize) cost or (maximize) reward
🔥
Notice that we want to maximize the reward (or minimize the cost) under the expectation of the states that we are going to encounter in production / test environment, not training states.
Definition of Markov Chain
M={S,T}
S→state spaces∈S→state (discrete or continuous)T→transition operator described by P(st+1∣st)
The transition “operator”?
let μt,i=P(st=i),then μt is a vector of probabilities,let Ti,j=P(st+1=i∣si=j),then μt+1=Tμt
Definition of Markov Decision Process
M={S,A,T,r}
S→action space,s∈S→state (discrete or continuous)A→action space,a∈A→action (discrete or continuous)r:S×A→R→
T→transitional operator (now a tensor)μt,i=P(st=i),ξt,k=P(at=k),So Ti,j,k=P(st+1=i∣st=j,at=k)μt+1,i=j,k∑Ti,j,kμt,jξt,k
Partially Observed Markov Decision Process
M={S,A,O,T,ξ,r}S→state spaces∈S→state (discrete or continuous)A→action spacea∈A→action (discrete or continuous)O→observation spaceo∈O→observation (discrete or continuous)T→transition operationξ→emission probability P(ot∣st)r:S×A→R⟶reward function
Goal of RL (Finite Horizon / Episodic)
Pθ(τ),where τ represents sequence of trajectorypθ(s1,a1,…,sT,aT)=P(s1)t=1∏TMarkov chain on (s,a)πθ(at∣st)P(st+1∣st,at)
So our goal is to maximize the expectation of rewards
θ∗=θargmaxEτ∼pθ[t∑r(st,at)]
Group action and state together to form augmented statesAugmented states form a Markov Chain.
Use average award θ∗=argmaxθT1Eτ∼pθ(τ)[∑tr(st,at)]
Use discounts
Q: Does p(st,at) converge to a stationary distribution as t→∞?
Ergodic Property: Every state can transition into another state with a non-zero probability.
“Stationary”: The same before and after transition
stationary means the state is same and before after transitionμ=Tμ
(T−I)μ=0
This means that μ is an eigenvector of T with eigenvalue of 1 (always exists under some regularity conditions)
μ=pθ(s,a)→stationary distribution
As t goes further and further from infinity, the reward terms starts to get dominated by the stationary distribution (we have almost inifite μ states and a finite non-stationary action/state pairs
RL is about maximizing the expectation of rewards. And expectations can be continuous in the parameters of the corresponding distributions even when the function we are taking expectation of is itself highly discontinuous. Because of this property we can use smooth optimization methods like Gradient Descent
📌
Because the distribution of the probability of getting a reward usually has a continuous PDF.
High-level anatomy of reinforcement learning algorithms
Which parts are expensive?
Value Function & Q Function
Remember that RL problems are framed as argmax of expectation of rewards, but how do we actually deal with the expectations?
📌
Write it out recursively!
We can write this expectation out explicitly as a nested expectation (looks like a DP algorithm!)
Looks like its messy, but what if we have a function that tells us what value inside the Ea1∼π(a1∣s1) is?
Specifically, its the value of
r(s1,a1)+Es2∼p(s2∣s1,a1)[Ea2∼π(a2∣s2)[r(s2,a2)+⋯∣s2]∣s1,a1]
We will define it as our Q function
It’s a lot easier to modify the parameter θ of πθ(a1∣s1) if Q(s1,a1) is known
Q-Function Definition
Qπ(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at]→total reward from taking at in st
Value Function Definition
Vπ(st)=t′=t∑TEπθ[r(st′,at′)∣st]→total reward from st
Vπ(st)=Eat∼π(at∣st)[Qπ(st,at)]
Now Es1∼p(s1)[Vπ(s1)] is the RL objective.
To use those two functions:
If we have policy π, and we know Qπ(s,a), then we can improve π:
set π′(a∣s)=1 if a=aargmaxQπ(s,a)
(This policy is at least as good as π and probably better)
OR
Compute gradient to increase the probability of good actions a
if Qπ(s,a)>Vπ(s), then a is better than average. Recall that Vπ(s)=E[Qπ(s,a)] under π(a∣s).
Modify π(a∣s) to increase the probability of a if Qπ(s,a)>Vπ(s)
Where Ft=∇xt,utf(x^t,u^t), Ct=∇xt,ut2c(x^t,u^t), ct=∇xt,utc(x^t,u^t)
Now we can use LQR with dynamics fˉ, cost cˉ to find δxt and δut
IQLR is basically just an approximation of Neuton’s method for solving the control objective.
But we used a first-order dynamics approximation here, to get Newton’s method, we need to use a second order dynamics approximation (DDP)
🔥
But this may be a bad idea because we are always going to the approximated minima and the approximation is only usable within a short range
We can just add a constant α term to the kt in the forward pass ⇒ allows us how much we want to control from our starting point (the smaller α is the closer we are to the original starting point)
Original differential dynamic programming algorithm.
Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization.
Practical guide for implementing non-linear iterative LQR.
Levine, Abbeel. (2014). Learning Neural Network Policies with Guided
Policy Search under Unknown Dynamics.
Probabilistic formulation and trust region alternative to deterministic line search.
Types of RL Algorithms
Remember the objective of RL:
θ∗=θargmaxEτ∼pθ(τ)[t∑r(st,at)]
Policy Gradient
Directly differentiate the above objective
Value-based
Estimate value function or Q-function of the optimal policy (no explicit policy)
Then use those functions to prove policy
Actor-critic (A mix between policy gradient and value-based)
Estimate value function or Q-function of the current policy, use it to improve policy
Model-based RL
Estimate the transition model, and then
Use it for planning (no explicit policy)
Use it to improve a policy
Other variants
Supervised Learning of Behaviors(LEC 2)
Imitation Learning
⚠️
Does it work: NO!!!!
In practice, after a long time of training it actually worked.
“tightrope walker scenerio”: has to follow exact same behavior otherwise the model does not know what to do
Three angles supervised to steer differently! This trick helped a lot
More like in training data we have scenerios to train the machine how to correct its little mistakes.
Instead of being clever about pπθ(ot), let’s be clever about pdata(ot) ⇒ Do DAgger(Data Aggregation)
DAgger (Data Aggregation)
🔥
Addresses “Distributional Drift”, adds on-policy data to the model
Idea: To collect data from pπθ(ot) instead of pdata(ot)
How: Just run πθ(ot) but we need labels at
train πθ(at∣ot) on human dataset D={o1,a1,…,oN,aN}
run πθ(at∣ot) to obtain Dπ={o1,…,oM}
Ask humans to label Dπ with actions at
Aggregate D←D∪Dπ
But:
Lots of the problem is in step 3
We as humans have to watch for feedback and then give optimal actions, cannot just watch a state and give output
What if the model does NOT drift?
Need to mimic expert behavior very accurately
Why might we fail the expert?
Figure 1Figure 2
Non-Markovian Behavior
Unnatural for humans to do perfect Markovian actions: Humans more like πθ(at∣o1,…,ot) ⇒ based on previous observations
Solve this by adding RNN / LSTM cells to the network (Fig. 1)
May still work poorly (Figure 2)
Multimodal Behavior (Multiple Modes/Peaks in real distribution)
If discrete actions, then this is not an issue
But continuous requires exponential discrete bins (for softmax)
Solved by
Output mixture of Gaussians
π(a∣o)=∑iwiN(μi,Σi)
Tradeoffs:
need more output parameters
The ability to model with this in high dimensions is challenging (in theory the number of mixture elements rises exponentially with dimension increase)
Latent variable models
The output is still Gaussian
In addition to inputting an image into NN, we also input a latent variable(may be drawn from a prior distribution) into the model.
Conditional Variational Autoencoders (VAEs)
Normalizing Flow/realNVP
Stein Variational Gradient Descent
Complex to train
Audoregressive discretization
discretizes one dimension at a time using an NN trick ⇒ never incur the exponential cost
Adds a softmax for one dimension, does a discrete sampling (and obtains a dim 1 value), feed the dim 1 value into another NN and softmax layer, do sampling, obtain dim 2 value, so on.
Have to pick bins
Autoregressive discretization
Questions:
Does including history info (LSTM/RNN) mitigate causal confusion?
My guess is no since history has no correlation with confusion
Can DAgger mitigate causal confusion?
My guess is yes since the model will confuse and then this the part of data that the model is confused on will then be manually labeled.
Whats wrong with imitation learning:
Humans need to provide data, which is limited
DL works best when data is plentiful
Humans are not good at providing some actions
Humans can learn autonomously, can we do the same on machines?
Unlimited data from own experience
continuous self-improvement
Analysis of Imitation Learning
Ross et al. “Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”
We assume:
c(s,a)={01if a=π∗(s)otherwiseπθ(a=π∗(s)∣s)≤ϵfor s∼ptrain(s)←the model is doing well on training setactually enough for Eptrain(s)[πθ(a=π∗(s)∣s]≤ϵ
then…
with DAgger
ptrain(s)→pθ(s)E[t∑c(st.at)]≤ϵT
Without DAgger
if ptrain(s)=pθ(s):pθ(st)=probability that we made no mistakes(1−ϵ)tptrain(st)+(1−(1−ϵ)t)pmistake(st)
We cannot assume anything for pmistake
But we can bound the total variation divergence of training and testing
∣pθ(st)−ptrain(st)∣=substitute in the above equation in for pθ(st)(1−(1−ϵ)t)∣pmistake(st)−ptrain(st)∣≤(1−(1−ϵ)t)×2
Another way to imitate (Goal Conditional Behavioral Cloning)
After clarification from class:
Sometimes we have bad “demonstrations” that may lead to different end states
Those demonstrations may be close to each other enough that we can train some “shared policy” for each of those different end states (in a sense that most of previous states can be obtained by a shared weight)
During production / test, specify the end state you want to achieve and then we can predict πθ(a∣s,g)
“Learning to Reach Goals via Iterated Supervised Learning” - Dibya, Abhishek
Model-Based RL
e.g.
Dyna
Guided Policy Search
Generate samples(run the policy) ⇒
Fit a model of p(st+1∣st,at) ⇒
Then improve the policy(a few options)
Improving policy:
Just use the model to plan (no policy)
Trajectory optimization / optimal control (primarily in continuous spaces)
Backprop to optimize over actions
Discrete planning in discrete action spaces
Monte Carlo tree search
Backprop gradients into the policy
Requires some tricks to make it work
Use the model to learn a separate value function or Q function
Dynamic Programming
Generate simulated experience for model-free learner
Value function based algorithms
e.g.
Q-Learning, DQN
Temporal Difference Learning
Fitted Value Iteration
Generate samples(run the policy) ⇒
Fit a model of V(s) or Q(s,a) ⇒
Then improve the policy(set π(s)=argmaxθQ(s,a))
Direct Policy Gradients
e.g.
REINFORCE
Natural Policy Gradient
Trust Region Policy Optimization
Generate samples (run the policy) ⇒
Estimate the return
Rτ=t∑r(st,at)
Improve the policy by
θ←θ+α∇θE[t∑r(st,at)]
Actor-critic
e.g.
Asynchronous Advantage Actor-Critic (A3C)
Soft actor-critic (SAC)
Generate samples ⇒
Fit a model V(s) or Q(s,a) ⇒
Improve the policy
θ←θ+α∇θE[t∑r(st,at)]
Tradeoff between algorithms
Sample efficiency
“How many samples for a good policy?”
Most important question: Is the algorithm off-policy?
Off policy means being able to improve the policy without generating new samples from that policy
On policy means we need to generate new samples even if the policy is changed a little bit
⚠️
We may still want to use a less efficient algorithm because of the other tradeoffs and maybe the price of generating new samples is really cheap.
Stability & Ease of use
Does our policy converge, and if it does, to what?
🔥
Supervised Learning: Almost always Gradient Descent is very likely to converge
Reinforcement Learning: Often not Gradient Descent
Value function fitting:
Fixed point iteration
At best, minimizes error of fit (“Bellman error”)
Not the same as expected reward
At worst, doesn’t optimize anything
Many popular DRL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case
Model-based RL
Model is not optimized for expected reward
Model minimizes error of fit
This will converge
No guarantee that better model = better policy
Policy Gradient
The only one that performs Gradient Descent / Ascent on the true objective
Different assumptions
Stochastic or deterministic environments?
Continuous or discrete (states and action)?
Episodic(finite T) or infinite horizon?
Different things are easy or hard in different settings
Easier to represent the policy?
Easier to represent the model?
Common Assumptions:
Full observability
Generally assumed by value function fitting methods
Can be mitigated by adding recurrence
Episodic Learning
Often assumed by pure policy gradient methods
Assumed by some model-based RL methods
Although other methods not assumed, tend to work better under this assumption
Continuity or smoothness
Assumed by some continuous value function learning methods
Markov property is not actually used so we can just use oi,t in place of si,t
What’s wrong?
Blue graph corresponds to PDF of reward, green bars represent samples of rewards, and dashed blue graph corresponds to fitted policy distributionWhat if reward is shifted up (adding a constant) (while everything maintains the same)?
We see huge differences in fitted policy distributions ⇒ we have a high variance problem
How we can modify policy gradient to reduce variance?
🔥
Causality: policy at time t’ cannot affect reward at time t if t<t’
We want to show that the expectation of the previous rewards would converge to 0 so we can rewrite the summation. Note that removing those terms in finite time would change the estimator but it is still unbiased.
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)reward to go Q^i,t(t′=t∑Tr(si,t′,ai,t′))
We have removed some items from the sum therefore the variance is lower.
The power on the γ ⇒ t’−t is important because now the gradient is NOT decresing with t stepping ⇒ we’re putting less focus on the future rewards in the current step, but in future steps those rewards will be valued more again.
If we let the power be t’−1 or t−1 ⇒ then as we train from t=0 to t→∞, the gradient would decrease as t increases
Baselines
We want policy gradients to optimize reward actions that are above average and penalize reward actions that are below average.
∇θJ(θ)≈N1i=1∑N∇θlogpθ(τ)[r(τ)−b]
Without baseline (b = 0), but with baseline, set b as
b=N1i=1∑Nr(τ)
Changing b will keep the esitimator unbiased but will change the variance!
Intuition: Different baseline for every entry in the gradient, the baseline for every parameter value is the expected value of the reward weighted by the magnitude of the gradient for the parameter value
Off-policy setting
Policy gradient is defined to be on-policy
The expectatio causes us to need to update samples every time we have updated the policy, but problem is Neural Nets only change a bit with each gradient step.
So we can modify the policy a bit
Instead of having samples from pθ(τ), we have samples from pˉ(τ)
We can use importance sampling to accomodate this case
∏t’=1tπθ(at‘∣st’)πθ‘(at‘∣st’) ⇒ future actions won’t affect current weight ⇒ however, looks like we still have the term at the end ⇒ its also a trouble because it’s exponential in T
∏t′′=tt′πθ(at′′∣st′′)πθ′(at′′∣st′′) ⇒ we cannot complete ignore this because stripping this term would not be gradient descent but actually”Policy Iteration” algorithm
Since the first term is exponential, we can try to write the objetive a bit differently
∇θ′J(θ′)≈N1i=1∑Nt=1∑Tπθ(si,t,ai,t)πθ′(si,t,ai,t)∇θ′logπθ′(ai,t∣si,t)Q^i,t=N1i=1∑Nt=1∑TCan we ignore this part?πθ(si,t)πθ′(si,t)πθ(ai,t∣si,t)πθ′(ai,t∣si,t)∇θ′logπθ′(ai,t∣si,t)Q^i,t
This does not necessarily give the optimal policy, but its a reasonable approximation.
Policy Gradient with Automatic Differentiation
We don’t want to be inefficient at computing gradients ⇒ use backprop trick
So why don’t we start from the MLE approach to Policy Gradients?
∇θJML≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)
To start doing backprop, let’s define the objective function first.
JML(θ)≈N1i=1∑Nt=1∑Tlogπθ(ai,t∣si,t)
So we will define our “pseudo-loss” approximation as weighted maximum likelihood
J~(θ)≈N1i=1∑Nt=1∑Tlogπθ(ai,t∣si,t)Q^i,t
A few reminders about tuning hyper-parameters
Gradients has very high variance
Isn’t the same as supervised learning
Consider using much larger batches
Tweaking learning rates is VERY hard
Adaptive step size rules like ADAM can be OK-ish
Covariant/natural policy gradient
🤔
One thing about one dimension state and action spaces are that you can visualize the policy distribution easily one a 2d plane
θ←θ+α∇θJ(θ)
We can view the first order gradient descent as follows
θ′←θ′argmax(θ′−θ)⊤∇θJ(θ)s.t. ∣∣θ′−θ∣∣2≤ϵ
Thank about ϵ as a high-dimensional sphere.
We can instead put constraint on the distribution of the policy
θ′←θ′argmax(θ′−θ)⊤∇θJ(θ)s.t. D(πθ′,πθ)≤ϵ
Where D(⋅,⋅) is a distribution dimension measure.
So we want to choose some parameterization-independent divergence measure
Usually KL-divergence
DKL(πθ′∣∣πθ)=Eπθ′[logπθ−logπθ′]
A but hard to plug into our gradient descent ⇒ we will approximate this distance using 2nd order Taylor expansion around the current parameter value θ
Advanced Policy Gradients
Why does policy gradient work?
Estimate A^π(st,at) (Monte-carlo or function approximator) for current policy π
Use A^π(st,at) to get improved policy π′
We are going back and forth between (1) and (2)
Looks familiar to policy iteration algorithm!
Nice thing about policy gradients:
If advantage estimator is perfect, we are just moving closer to the “optimal” policy for the advantage estimator not jumping to the optimal
But how do we formalize this?
If we set the loss function of policy gradient as
J(θ)=Eτ∼pθ(τ)[t∑γtr(st,at)]
Then we can calculate how much we improved by
J(θ′)−J(θ)=Eτ∼pθ′(τ)[t∑γtAπθ(st,at)]
Here we claimed that J(θ’)−J(θ) is equal to the expectation of all discounted advantage estimates in the new trajectory that the new policy would explore.
We will prove by:
J(θ′)−J(θ)=J(θ′)−Es0∼p(s0)[Vπθ(s0)]=J(θ′)−It’s the same as above because the initial states s0 are the sameEτ∼pθ′(τ)[Vπθ(s0)]=J(θ′)−Eτ∼pθ′(τ)[t=0∑∞γtVπθ(st)−t=1∑∞γtVπθ(st)]=J(θ′)+Eτ∼pθ′(τ)[t=0∑∞γt(γVπθ(st+1)−Vπθ(st))]=Eτ∼pθ′(τ)[t=0∑∞γtr(st,at)]+Eτ∼pθ′(τ)[t=0∑∞γt(γVπθ(st+1)−Vπθ(st))]=Eτ∼pθ′(τ)[t=0∑∞γt(r(st,at)+γVπθ(st+1)−Vπθ(st))]=Eτ∼pθ′(τ)[t=0∑∞γtAπθ(st,at)]
🔥
This proof shows that by optimizing Eτ∼pθ’[∑tγtAπθ(st,at)], we are actually improving our policy
But how does this relate to our policy gradient?
Eτ∼pθ′(τ)[t∑γtAπθ(st,at)]=t∑Est∼pθ′(st)[Eat∼πθ′(at∣st)[γtAπθ(st,at)]]=t∑Est∼pθ′(st)[importance samplingEat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]]≈t∑ignoring distribution mismatchEst∼pθ(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]]=Aˉ(θ′)
We want this approximation to be true so that
J(θ′)−J(θ)≈Aˉ(θ′)
And then we can use
θ′←θ′argmaxAˉ(θ)
to optimize our policy
And use A^π(st,at) to get improved policy π’
But when is this true?
Bounding the distribution change (deterministic)
We will show:
🔥
pθ(st) is close to pθ’(st) when πθ is close to πθ’
We will assume that πθ is a deterministic policy at=πθ(st)
We define closeness of policy as:
πθ’ is close to πθ if πθ’(at=πθ(st)∣st)≤ϵ
pθ′(st)=probability we made no mistakes(1−ϵ)tpθ(st)+(1−(1−ϵ)t)some other distributionpmistake(st)
Implies that we can write the total variation divergence as
Where C is a constant being the largest value that the thing in the state exepectation can take on ⇒ its a probability times an advantage ⇒ largest possible reward times the number of time steps.
So:
C∈O(Trmax) or infinite time with discount (geometric series)O(1−γrmax)
Therefore for small enought ϵ, Aˉπ(θ′) is guarenteed to be similar to the true RL objective
Remember in the original policy gradient algorithm, θ←θ+α∇θJ(θ)
But some parameters change probabilities a lot more than others, we are not considering the KL Divergence in this case
⚠️
so we want to either form the policy to let them have equal effects, or we can change the learning rate for each of the parameters
But it seems like the problem can be fixed, notice that
Gradient Descent still poses some kind of constraint ∣∣θ−θ’∣∣2≤ϵ
Think about in parameter space, we are constrained to within some fixed distance(inside a circle) from the original parameter. ⇒ Indeed the learning rate α=∣∣∇θJ(θ)∣∣2ϵ can satisfy this constraint.
But in KL divergence, maybe we want an ellipse? (we want a circle in distribution space)
We will use second order taylor expansion to approximate the KL divergence to make it easier to calculate.
Now we see that the shape of the new ellipse in parameter space is determined by the matrix F
We can estimate this F expectation using samples
So therefore!
Using the approximation, our policy becomes
θ←θ′argmax(θ′−θ)⊤∇θJ(θ)s.t. ∣∣θ′−θ∣∣F2≤ϵ
With this we have the “natural gradeint”
θ←θ+αF−1∇θJ(θ)
If we want to enforce the constraint, we can choose a step size that enforces the constraint.
We can calculate step size α by
α=∇θJ(θ)⊤F∇θJ(θ)2ϵ
Or run conjugate gradient method to calculate inverse ∇θJ(θ) and get α as a by-product (see trust region policy optimization paper)
Several policy that utilizes this trick:
natural gradient: pick α
trust region policy optimization: pick ϵ
Can solve for optimal α while solving F−1∇θJ(θ)
Conjugate gradient works well for this
Recommended Readings
Natural Policy Gradient
Schulman, L., Moritz, Jordan, Abbeel (2015) Trust region policy optimization
Peters, Schaal. Reinforcement Learning of motor skills with policy gradients
IS objective directly
Proximal Policy Optimization (PPO)
Is this even a problem in practice?
👉
It’s a pretty big problem especially with continuous action spaces. Generally a good choice to stabilize policy gradient training
With this problem, the vanilla policy gradients do point in the right direction but they are extremely ill-posed (too busy decreasing σ.
Actor-Critic Method
Remember from the policy gradient that
∇θJ(θ)=Eτ∼pθ(τ)[(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))]≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t)(Reward to go Q^i,tt=1∑Tr(si,t,ai,t))
Q^i,t is the estimate of expected reward if we take action ai,t in state si,t
But Q^i,t currently has a very high variance
Q^i,t is only taking account of a specific chain of action-state pairs
This is because we approximated the gradient by stripping away the expectation
We can get a better estimate by using the true expected reward-to-go
Qπ(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at]
What about baseline? Can we apply a baseline even if we have a true Q function?
Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]
Turns out that we can do better(less variance) than applying bt=N1∑iQ(si,t,ai,t) because with the value function, the baseline can be dependent on the state.
So now our gradient becomes
∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)How much better the action ai,t is than the average action(Qπ(si,t,ai,t)−Vπ(si,t))≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)Aπ(si,t,ai,t)
We will also name something new: the advantage function:
How much better the action ai,t is than the average action
Aπ(st,at)=Qπ(st,at)−Vπ(st)
The better estimate Aπ estimate has, the lower variance ∇θJ(θ) has.
Lower variance than Monte Carlo evaluation, but higher bias (because V^ϕπ might be incorrect)
Batch actor-critic algorithm
⚠️
The fitted value function is not guarenteed to merge, same reason as the section “Value Function Learning Theory”
Discount Factor
If T→∞, V^ϕπ ⇒ the approximator for Vπ can get infinitely large in many cases, so we can regulate the reward to be “sooner rather than later”.
yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1)
Where γ∈[0,1] is the discount factor, it let’s reward you get decay in every timestep ⇒ So that the obtainable reward in infinite lifetime can actually be bounded.
One understanding that γ affects policy is that γ adds a “death state” that once you’re in, you can never get out.
Idea: Collect data and instead of training on them directly, put them into a replay buffer. At training instead of using the data just collected, fetch one randomly from the replay buffer.
Coming from the online algorithm, let’s see what problems do we need to fix:
(1) Under current policy, it is possible that our policy would not even have taken the next action ai and therefore it’s not cool to assume we will arrive at the reward r(si,ai,si′) ⇒ We may not even arrive at state si’
(2) Same reason, we may not have taken ai as our action under the current policy
We can fix the problem in (1) by using Qπ(st,at) ⇒ replace the term γV^ϕπ(si’)
Now
L(ϕ)=N1i∑∣∣Q^ϕπ(si,ai)−yi)∣∣2
And we replace the target value
yi=ri+γQ^ϕπ(si′,this is sampled from current policy, ai′∼πθ(ai′π∣si′)ai′π)
Same for (2), sample an action from current policy aiπ∼πθ(ai∣si) rather than using the original data
It’s fine to have high variance (not being baselined), because it’s easier and now we don't need to generate more states ⇒ we can just generate more actions
In exchange use a larger batch size ⇒ all good!
🔥
Still one problem left ⇒ si did not come from pθ(s) ⇒ but nothing we can do
Not too bad ⇒ Initially we want optimal policy on pθ(s), but we actually get optimal policy on a broader distribution
Now our final result:
Example Practical Algorithm:
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.
Not only does the policy gradient remain unbiased when you subtract any constant b, it still remains unbiased when you subtract any function that only depends on the state si (and not on the action)
Control variates - methods that use state and action dependent baselines
The n-step advantage estimator is good, but can we generalize(hybrid) it?
Instead of hard cutting between two estimators, why don’t we combine them everywhere?
A^GAEπ(st,at)=n=1∑∞wnA^nπ(st,at)
Where A^nπ stands for the n-step estimator
“Most prefer cutting earlier (less variance)”
So we set wn∝λn−1 ⇒ exponential falloff
Then
A^GAEπ(st,at)=t′=t∑∞(γλ)t′−tδt′, where δt′=r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′)
🔥
Now we can just tradeoff the bias and variances using the parameter λ
Examples of actor-critic algorithms
Schulman, Moritz, Levine, Jordan, Abbeel ‘16. High dimensional continuous control with generalized advantage estimation
Batch-mode actor-critic
Blends Monte Carlo and GAE
Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16. Asynchronous methods for deep reinforcement learning
Online actor-critic, parallelized batch
CNN end-to-end
N-step returns with N=4
Single network for actor and critic
Suggested Readings
Classic: Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation: actor-critic algorithms with value function approximation
Talks about contents in this class
Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic
Schulman, Moritz, Levine, Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns
Gu, Lillicrap, Ghahramani, Turner, Levine (2017). Q-Prop: sample-efficient policy-gradient with an off-policy critic: policy gradient with Q-function control variate
We know that π′ is as good as π, and probably better.
Same as before,
Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s)
Policy Iteration (Fitted Value Iteration)
So fitting value function can be done if we know the transition dynamics very easily using least square estimate… but what if we don’t know transition dynamics?
Policy Iteration without transition probability (Fitted Q-Iteration)
What if we change the policy evalution part to
Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Qπ(s′,π(s′))]
Now ⇒ if we have an (s,a,r,s’) tuple then even if policy π′ changes its action in state s we can still do policy evalution on the tuple because the only place the policy shows up is inside the expectation.
This simple change allows us to policy iteration style algorithms without actually knowing the transition dynamics
Instead of using E[V(si)], we used maxa’Qϕ(si’,ai’), the value function V(si) is approxiamted by maxa’Qϕ(si,a’) and intead of doing expectations on the future possible states, we used the next state in the sample si’
Works for off-policy samples
Only one network, no high-variance policy gradient
No convergence guarentees for non-linear function approximation (more on this later)
Special Cases of Fixed Q-Fitting
During exploration phase, using the greedy algorithm may not be the optimal choice (we want to maximize entropy!)
“epsilon chance of exploring other options other than optimal action”
Common practice is to vary epsilon while training (early on(Q is bad), larger epsilon)
“Exponential”
π(at∣st)∝exp(Qϕ(st,at))
Best action will be most frequent, but leave some probability for exploration. And plus, if there is a second largest Q, then this second largest is considered more than other options.
Value Function Learning Theory
For tabular value iteration algorithm:
The Bellman Operator encaptures the logic of updating V(s)
And one fact is that:
V∗, the value function for the optimal policy, is a fixed point of B
V∗ always exists, is always unique, and always corresponds to the optimal policy
Does fixed point iteration converge to V∗
We can prove that the value iteration reaches V∗ because B is a contraction
Contraction means: for any V and Vˉ, we have ∣∣BV−BVˉ∣∣∞≤γ∣∣V−Vˉ∣∣∞ where γ∈(0,1)
Meaning that V and Vˉ will get closer and closer as we apply B to them
If we replace Vˉ by V∗ (and since BV∗=V∗),
∣∣BV−V∗∣∣∞≤γ∣∣V−V∗∣∣∞
For fitted value iteration algorithm:
Step 1 basically applies the bellman operator
We will define a new operator for step 2, Π (pi)
Step 2 performs the following:
“Find an value function in the value function model hypothesis space Ω that optimizes the objective”
V′←V′∈Ωargmin21∑∣∣V′(s)−(BV)(s)∣∣2
So step 2 performs this model fitting operator Π that projects BV onto hypothesis space Ω, our iteration algorithm can now be described as
V←ΠBV
B is still a contraction with respect to ∞ norm
∣∣BV−BVˉ∣∣∞≤γ∣∣V−Vˉ∣∣∞
Π is a contraction wr.t. L2-norm
∣∣ΠV−ΠVˉ∣∣2≤∣∣V−Vˉ∣∣2
However, ΠB is not a contraction of any kind
So fitted value iteration does not converge in general and often does not converge in practice
Same with fitted Q-Iteration, Batch actor-critic
🧙🏽♂️
Bad theoretical properties, but works in practice lol
But why? Isn’t value function fitting just graident descent?
We can actually turn this into a gradient descent algorithm by computing into those target functions ⇒ resulting algorithm is called residual algorithms, has poor numerical properties & doesn’t work well in practice
Correlation Problem in Q-Learning
⚠️
Sequential States are Strongly Correlated
And Target value is always changing (chasing its own tails)
Think about a sequence
In the early 3 steps, the target value is kind of “overfit” to those 3 values
In the supervised setting of learning Q / Value functions, the data points seen by the target values are only local states
We can solve this by adding more batches at one time (by synchronized parallel Q-learning or asynchronous parallel Q-learning)
Or we use replay buffers (Since fitted Q-Learning is basically a off-policy method) ⇒ Samples are no longer correlated and multiple samples in the batch (low-variance gradient)
But where does the data come from:
Need to periodically feed the replay buffer
⚠️
On full fitted Q-Iteration algorithm, we set
ϕ←argminϕ21∑i∣∣Qϕ(si,ai)−yi∣∣2
But do we want it to actually converge to argmin if the sample has high variance?
Maybe just move one gradient step instead
ϕ←ϕ−αi∑dϕdQϕ(si,ai)(Qϕ(si,ai)−yi)
Usually ⇒ K∈[1,4], N∈[10000,50000]
Target network makes sure that we’re not trying to hit a moving target ⇒ and now it looks more like supervised regression
🔥
Popular Alternative of target network update:
ϕ′←τϕ′+(1−τ)ϕτ=0.999 works well
Question: Chasing its tail relationship, is there a typical scenerio where Q function chases its tail and shows bad behaviors?
Classic Deep Q-Learning Algorithm (DQN)
A special case(with specific parameters) of Q-learning with replay buffer and target network
A general view of Q-Learning
Online Q-learning (last lecture): evict immediately, process 1, process 2, and process 3 all run at the same speed.
DQN: Process 1 and process 3 run at the same speed, process 2 is slow.
Fitted Q-iteration: process 3 in the inner loop of process 2, which is in the inner loop of process 1
Overestimation in Q-Learning and Double Q-Learning
Q functions are not accurate (much much larger)
🔥
Why does Q-function think that it’s gonna get systematically larger values than true values.
Because target value
yj=rj+γthis last term is the problemaj′maxQϕ′(sj′,aj′)
Imagine two r.v. X1,X2
We can prove:
E[max(X1,X2)]≥max(E[X1],E[X2])
But:
Qϕ′(s′,a′) is not perfect, it is “noisy” ⇒ We are always selecting the positive errors while training using the max
Note that
a′maxQϕ′(s′,a′)=Qϕ′(s′,action selected according to Qϕ′a′argmaxQϕ′(s′,a′))
So use “double Q-Learning” ⇒ use two networks
Update ϕA by using ϕB as a target network and ϕA as an action selector, opposite with ϕB update.
QϕA(s,a)←r+γQϕB(s′,a′argmaxQϕA(s′,a′))
QϕB(s,a)←r+γQϕA(s′,a′argmaxQϕB(s′,a′)
Intuition: Choosing the “optimal” (but due to noise) action in QϕA will cause the noise of the same action in QϕB to be added into the result, therefore the “overestimation” effect is mitigated
In practice, instead of having two Q functions, use the two Q functions that we already have.
We already have ϕ and ϕ′
In standard Q-learning:
y=r+γQϕ′(s′,a′argmaxQϕ′(s′,a′))
Double Q-Learning:
y=r+γQϕ′(s′,a′argmaxQϕ(s′,a′))
Multi-step returns
Q-learning target
yj,t=rj,t+γaj,t+1maxQϕ′(sj,t+1,aj,t+1)
Early on in training, the term γmaxaj,t+1Qϕ‘(sj,t+1,aj,t+1) only gives us random noise if the network is bad, so rj,t is the most important. But later in training, when Qϕ′ gets better and better, this term dominates(since we have more terms then just a single term reward) and is more important.
So we can have something like “Monte Carle” sum of rewards in Actor-Critic algorithms
Idea is to train another network μθ(s) such that μθ(s)≈argmaxaQϕ(s,a)
dθdQϕ=dθdadadQθ
classic: NFQCA
recent: TD3 and SAC
Stochastic Optimization
Simple Solution
amaxQ(s,a)≈max{Q(s,a1),…,Q(s,aN)}
Where (a1,…,aN) are sampled from some distribution (e.g. uniform)
Dead simple
Efficiently parallelizable
But - not very accurate
More accurate (works OK for ≥ 40 dims):
Cross-entropy method (CEM)
simple iterative stochastic optimization
sampling actions from distribution like in the simple method but then refines to a range and then continues to sample in a smaller and smaller range
CMA-ES
substantially less simple iterative stochastic optimization
Implementation Tips
Instability of Q LearningHuber loss (interpolating between squared error and absolute value loss)
Q-learning takes some care to stabilize
Test on easy, reliable tasks first, make sure the implementation is correct
Large replay buffers help improve stability
Looks more like fitted Q-iteration
Takes time, be patient - might be no better than random for a while
Start with high exploration (epsilon) and gradually reduce
Bellman error gradients can be big; clip gradients or use Huber loss
Double Q-learning helps a lot in practice
Simple & no downsides!
N-step returns also help a lot
but have some downsides (will systematically bias the objective)
Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too
Run multiple random seeds, its very inconsistent between runs
Suggested Readings
Classic papers
Watkins. (1989). Learning from delayed rewards: introduces Q-learning
Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks
Deep reinforcement learning Q-learning papers
Lange, Riedmiller. (2010). Deep auto-encoder neural networks in reinforcement learning: early image-based Q-learning method using autoencoders to construct embeddings
Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-
learning with convolutional networks for playing Atari.
Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.
Lillicrap et al. (2016). Continuous control with deep reinforcement learning: continuous Q-learning with actor network for approximate maximization.
Gu, Lillicrap, Stuskever, L. (2016). Continuous deep Q-learning with model-based acceleration: continuous Q-learning with action-quadratic value functions.
Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling network architectures for deep reinforcement learning: separates value and advantage estimation in Q-function.
Model-based Methods
🔥
Why Model-based methods? Because if we know the transition function f(st,at)=P(st+1∣st,at) then we can use trajectory planning ⇒ So we can learn f(st,at) from data
📌
Model-based RL can be prone to error in distribution mismatch ⇒ very like behavior cloning because when planned trajectory queries out-of-sample states, then f(⋅) is likely to be wrong
NPC
Performance Gap Issue
We want good exploration to gather more data to address the distribution shift, but with a not-so-good early model we cannot deploy good exploration techniques
Basically the model early on overfits an function approximator and the exploration policy is stuck on a local minima / maxima
Uncertainty Estimation
Uncertainty estimation helps us reduce performance gap.
Under this trick, the expectation of reward under high-variance prediction is very low, even though the mean is the same
However:
We still need exploration to get better model
Expected value is not the same as pessimistic value
But how to build it?
Use NN?
Cannot express uncertainty of unseen data accurately ⇒ when querying out-of-sample data, the uncertainty output is arbitrary
“The model is certain about data, but we are not certain about the model”
Estimate Model Uncertainty
Estimate argmaxθlogP(θ∣D)=argmaxθlogP(D∣θ)
Can we instead estimate the distribution P(θ∣D)?
We can use this to get a posterior distribution over the next state
∫p(st+1∣st,at,θ)p(θ∣D)dθ
Bayesian Neural Nets
Every weight, instead of being a number, is a distribution
But it’s difficult to join parameters into a single distribution
So we do approximation
p(θ∣D)=∏ip(θi∣D)
Very Crude approximation
p(θi∣D)=N(μi,σi2)
Instead of learning numerical value learn mean value and std/variance
Bllundell et al. Weight Uncertainty in Neural Networks
Gal et al. Concrete Dropout
Bootstrap Ensembles
Train several different neural nets and bootstrap samples(Di sample with replacement from D) to feed to those networks
So each network learns slightly differently
In theory each NN learns well on their own training data but makes different mistakes outside of training data
Policy gradient might be more stable (if enough samples are used) because it does not require multiplying many Jacobians
Parmass et al. “18: PIPP: Flexible Model-Based Policy Search Robust to the Curse of Chaos”
It we use policy gradient to backprop through our model dynamics model, there would still be a problem (although generating samples does not require interacting with physical world or simulator)
Left: Open-loop planning with long rollouts; Middle: Open-loop planning with shorter rollouts; Right: Collect real-world data from current policy and do short rollouts planning with data sampled from real distribution
But this does not solve the problem because the point of using model-based RL is being sample efficient and using sampled real world data and do short-rollout open-loop planning restrict how much we can change our policy.
Model-based RL with Policies
📌
Question is how to use the model’s predictions? fit Q functions or Value functions or what?
Model-Based Acceleration (MBA)
Gu et al. Continuous deep Q-learning with model-based acceleration. ‘16
Model-Based Value Expansion (MVE)
Feinberg et al. Model-based value expansion. ’18
Model-Based Policy Optimization (MBPO)
Janner et al. When to trust your model: model-based policy optimization. ‘19
Good Idea:
Sample Efficient
Bad Idea:
Additional Source of bias if model is not correct
Reduce this by using ensemble methods
If not seen states before then we still cannot perform valid action
Multi-step Models and Successor Representations
📌
The job of the model is to evaluate the policy
Vπ(st)=t=t′∑∞γt′−tEp(st′∣stEat′∣st′)[r(st′,at′)]=t=t′∑∞γt′−tEp(st′∣st)[r(st′)]=t=t′∑∞γt′−ts∑p(st′=s∣st)r(s)=s∑Add (1−γ) term to the front makes it pπ(sfuture=s∣st)(t=t′∑∞γt′−tp(st′=s∣st))r(s)
Update pπ(F=1∣st,at,s) using SGD with cross entropy loss
Eysenbach, Salakhutdinov, Levine. C-Learning: Learning to Achieve Goals via Recursive Classification. 2020.
Exploration (Lec 13)
It’s hard for algorithms to know what rules of the environment are “Mao” Game
The only rule you may be told is this one
Incur a penalty when you break a rule
Can only discover rules through trial and error
Rules don’t always make sense to you
So here is the definition of exploration problem:
How can an agent discover high-reward strategies that require a temporally extended sequence of complex behaviors that, individually, are not rewarding?
How can an agent decide whether to attempt new behaviors or continue to do best things it knows so far?
So can we derive an optimal exploration strategy:
Tractable(易于理解) - If it is easy to know that the policy is optimal or not
Multi-arm bandits (1-step, stateless RL) / Contextual bandits (1 step, but there’s a state)
Can formalize exploration as POMDP(Partial-observed MDP because although there’s only one time, performing actions help us expand our knowledge) identification
policy learning is trivial even with POMDP
small, finite MDPs
Can frame as Bayesian model identification, reason explicitly about value of information
Large or infinite MDPs
Optimal methods don’t work
Exploring with random actions: oscillate back and forth, might not go to a coherent or interesting place
But can take inspiration from optimal methods in smaller settings
use hacks
Bandits
The dropsophila(黑腹果蝇, 模式生物) of exploration problems
“One arm bandit” is a lot machineMulti-arm bandit ⇒ have to make decision on which machine to play; Pulling one of the arms doesn’t mean it’s a bad decision
So the definition of the bandit:
Assume r(ai)∼pθi(ri)
e.g. p(ri=1)=θi and p(ri=0)=1−θi
Where (but you don’t know):
θi∼p(θ)
This defines a POMDP with s=[θ1,…,θn]
Belief state is p^(θ1,…,θn)
Solving the POMDP yields the optimal exploration strategy
But this is an overkill - belief state is huge! ⇒ we can do very well with much simpler strategies
We will define regret as the difference from optimal policy at time step T
Reg(T)=TE[r(a∗)]−t=1∑Tr(at)
How do we beat bandit?
Variety of relatively simple strategies
Often can provide theoretical guarantees on regret
Empirical performance may vary
Exploration strategies for more complex MDP domains will be inmspired by those strategies
Optimistic exploration Reg(T)∈O(logT)
Keep track of average reward μ^a for each action a
Exploitation: pick a=argmaxμ^a
Optimistic Estimate: a=argmaxμ^a+Cσa
“Try each arm until you are sure it’s not great”
a=argmaxμ^a+N(a)2lnT where N(a) is number of times we have picked this action
To put this in practice, use r+(s,a)=r(s,a)+B(N(s)) ⇒ easy to add in but hard to tune the weights of B(N(s)).
Problem is in continuous states ⇒ we never see same thing twice ⇒ but some states are more similar than others
Beliemare et al. “Unifying Count-based Exploration …”
So we will
Fit a density model pθ(s) or pθ(s,a)
pθ(s) might be high even for a new s if s is similar to previous seen states
But how can the distribution of pθ match the distribution of a count?
Notice that the true probability is P(s)=nN(s), and after seeing s, P’(s)=n+1N(s)+1
So we need to output densities, but doesn’t necessarily need to produce great samples
Opposite considerations from many popular generative models like GANs
Bellemare et al. “CTS” Model: Condition each pixel on its top-left neighborhood
Other models: Stochastic neural networks, compression length, EX2
Probability Matching / Posterior Sampling (Thompson Sampling)
Assume r(ai)∼pθi(ri) ⇒ which defines a POMDP with s=[θ1,…,θn]
Belief state is p^(θ1,…,θn)
Idea:
Sample θ1,…,θn∼p^(θ1,…,θn)
Pretend the model θ1,…,θn is correct
Take the optimal action
Update the model
Chapelle & Li, “An Empirical Evaluation of Thompson Sampling.”
⇒ Hard to analyze theoretically but can work very well empirically
Osband et al. “Deep Exploration via Boostrapped DQN”
What do we sample?
how do we represent the distribution?
Bootstrap ensembles:
Given a dataset D, resample with replacement N times to get D1,…,DN
train each model fθi on Di
To sample from p(θ), sample i∈[1,…,N] and use fθi
❓
Train N big neural nets is expensive, can we avoid it?
In the paper they trained a net with different heads ⇒ this may actually be undesirable since now the outputs are correlated but they may be uncorrelated enough for us to use
🧙🏽♂️
This might help better than epsilon-greedy because instead of maybe oscillating back adn forth (and not go somewhere intersting at all), we are commited to a randomized but internally consistant strategy for an entire episode
No change to original reward function
Very good bonuses often do better than this
Methods that use Information Gain
Bayesian Experimental Design
If we want to determine some latent variable z ⇒ z might be optimal action or its value
Which action do we take?
Let H(p^(z)) be the current entropy of our z estimate
Let H(p^(z)∣y) be the entropy of our z estimate after observation y (y might be r(a) in the case of RL)
The lower the entropy, the more precisely we know z
IG(z,y)=Ey[H(p^(z))−H(p^(z)∣y)]
Typically depends on action, so we will use the notion IG(z,y∣a)=Ey[H(p^(z)−H(p^(z)∣y)∣a)]
🧙🏽♂️
One important thing to find is decide what y is in our problem and what z is
Russo & Van Roy “Learning to Optimize via Information-Directed Sampling”
y=ra,z=θa
Observe reward and learn parameters of p^(ra)
g(a)=IG(θa,ra∣a)→information gain of a
Δ(a)=E[r(a∗)−r(a)]→expected suboptimality of a
We want to gain more information but we don’t want our policy to be suboptimal
So we will choose a according to
aargming(a)Δ(a)2
Houthooft et al. “VIME”
Question:
Information gain about what?
Reward r(s,a)
Not useful is reward is sparse
State density p(s)
A bit strange, but somewhat makes sense!
Dynamics p(s’∣s,a)
Good proxy for learning the MDP, though still heuristic
Generally they are intractable to use exactly, regardless of what is being estimated
A few approximations:
Prediction Gain (Schmidhuber ‘91, Bellemare ‘16)
logpθ’(s)−logpθ(s)
If density changed a lot, then the state is novel
Variational Inference (Houthooft et al. “VIME”)
IG can be equivalently written as DKL(p(z∣y),p(z))
Learn about transitions pθ(st+1∣st,at):z=θ
y=(st,at,st+1)
Intuition: a transition is more informative if it causes belief over θ to change
Idea:
Use variational inference to estimate q(θ∣ϕ)≈p(θ∣h)
Represent q(θ∣ϕ) as product of independent Gaussian parameter distributions with mean ϕ
p(θ∣D)=∏ip(θi∣D)
p(θi∣D)=N(ϕμi,σi2)
See Blundell et al. “Weight uncertainty in neural networks
Given new transition (s,a,s’), update ϕ to get ϕ′
use DKL(q(θ∣ϕ’),q(θ∣ϕ)) as approximate bonus
Appealing mathematical formalism
Models are more complex, generally harder to use effectively
Counting with Hashes
What if we still count states, but compress them into hashes and count them
Compress s into k-bit code via ϕ(s) then count N(ϕ(s))
shorter codes = more hash collisions
similar states get the same hash? Maybe
Depends on model we choose
We can use an autoencoder and this improve the odds of doing this
Tang et al. “#Exploration: A Study of Count-Based Exploration”
Implicit Density Modeling with Exemplar Models
Fu et al. “EX2: Exploration with Exemplar Models…”
We ned pθ(s) to be able to output densities, but doesn’t necessarily need to produce great samples ⇒ can we explicitly compare the new state to past states?
🧙🏽♂️
Intuition: The state is novel if it is easy to distinguish from all previous seen states by a classifier
For each observed state s, fit a classifier to classify that state against all past states D and use classifier error to obtain density
pθ(s)=Ds(s)1−Ds(s)
where Ds(s) is the probability that the classifier assigns s as a “positive”
In reality, every state is unique so we regularize the classifier
Heuristic estimation of counts via errors
We just need to tell if state is novel or not
Given buffer D={(si,ai)}, fit f^θ(s,a)
So we basically use the error of f^θ(s,a) and actual f∗(s,a) as a bonus
ϵ(s,a)=∣∣f^θ(s,a)−f∗(s,a)∣∣2
But what function should we use for f∗?
f∗(s,a)=s’
f∗(s,a) as a neural network, but with parameters chosen randomly
We mentioned about how DKL(q(θ∣ϕ’),q(θ∣ϕ)) can be seen as Information Gain, it can also be viewed as change in network parameters ϕ
So if we forget about IG, there are many other ways to measure this
Stadie et al. 2015:
encode image observations using auto-encoder
build predictive model on auto-encoder latent states
use model error as exploration bonus
Schmidhuber et al. (see, e.g. “Formal Theory of Creativity, Fun, and Intrinsic Motivation):
exploration bonus for model error
exploration bonus for model gradient
many other variations
Suggested Readings
Schmidhuber. (1992). A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers.
Stadie, Levine, Abbeel (2015). Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models.
Osband, Blundell, Pritzel, Van Roy. (2016). Deep Exploration via Bootstrapped DQN.
Houthooft, Chen, Duan, Schulman, De Turck, Abbeel. (2016). VIME: Variational Information Maximizing Exploration.
Tang, Houthooft, Foote, Stooke, Chen, Duan, Schulman, De Turck, Abbeel. (2016). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning.
Fu, Co-Reyes, Levine. (2017). EX2: Exploration with Exemplar Models for Deep Reinforcement Learning.
Unsupervised learning of diverse behaviors
What if we want to recover diverse behavior without any reward function at all?
Learn skills without supervision
But then use them to accomplish goals
Learn sub-skills to use with hierarchical reinforcement learning
Explore the space of possible behaviors
One possible case is to describe a target goal and the machine would just try to reach the goal
xˉ may not be equal to xg at all…but we will do some updates regardless.
pθ(x∣z) ⇒ Latent Model (VAE) that transforms a proposed state z to x
However: This means the model tends to generate states that we’ve already seen (since we sample from p(z) ⇒ and we are not exploring very well
❓
How do we reach diverse goals?
We will modify step 4 ⇒ instead of doing MLE: θ,ϕ←argmaxθ,ϕE[logp(xˉ)], we will do a weighted MLE:
θ,ϕ←θ,ϕargmaxE[w(xˉ)logp(xˉ)]
We want to assign a greater weight to rarely seen actions, and since we’ve used a generator for pθ(x∣z), we can use
w(xˉ)=pθ(xˉ)α,α∈[−1,0)
⇒ Note that for any α∈[−1,0), the entropy H(pθ(x)) increases ⇒ proposing broader and broader goals
The RL formulation:
maxH(p(G))
π(a∣S,G) trained to reach goal G
as π gets better, the final state S gets close to G ⇒ p(G∣S) becomes more deterministic
To maximize exploration,
maxH(p(G))−H(p(G∣S))=maxI(S;G)
This is mainly for gathering data that leads to uniform density over the state distribution ⇒ doesn’t mean the policy is going randomly, what if we want a policy that goes randomly?
❓
But is state-entropy really a good objective?
Eysenbach’s Theorem: At test time, an adversary will choose the worst goal G ⇒ Then the best goal distribution use for training would be p(G)=argmaxpH(p(G))
Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration
State Marginal Matching (SMM)
Learn π(a∣s) to minimize DKL(pπ(s),p∗(s))
If we want to use intrinsic motivation, we have:
r~(s)=logp∗(s)−logpπ(s)
Note that this reward objective sums to exactly the state-marginal objective above ⇒ However RL is not aware that −logpπ(s) depends on π ⇒ tail chasing problem
We can prove that this π∗(a∣s) is a nash equilibrium for a two-player game
Learning Diverse Skills
π(a∣s,task indexz)
Reaching diverse goals is not the same as performing diverse tasks
not all behaviors can be captured by goal-reaching
π(a∣s)=πargmaxz∑Es∼π(s∣z)[r(s,z)]
We want reward states for other tasks z’=z to be unlikely, one way:
Use a classifier that gives you a softmax output of possibility of zs
r(s,z)=logp(z∣s)
Connection to mutual information:
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
Gregor et al. Variational Intrinsic Control. 2016
I(z,s)=maximized by using uniform prior p(z)H(z)−minimized by maximizing logp(z∣s)H(z∣s)
Offline RL (Batch RL / fully off-policy RL)
Huge gap in DL and DRL ⇒ supervised learning has tons of data!
Formally:
D={(si,ai,si′,ri)}s∼dπβ(s)a∼πβ(a∣s)r←r(s,a)
πβ is the unknown policy that collected the data
Off-Policy Evaluation (OPE)
Given D, estimate return of some policy J(πβ)
Offline Reinforcement Learning
Given D, learn the best possible policyπθ (such that the evidence of the model is good is supported by the dataset)
How is this even possible?
Find the “good stuff” in a dataset full of good and bad behaviors
Generalization: Good behavior in one place may suggest good behavior in another place
“Stitching”: Parts of good behaviors (even badly-performed parts can have good behaviors) can be recombined
Bad intuition: It’s like imitation learning
Though it can be shown to be provably better than imitation learning even with optimal data, under some structural assumptions ⇒ Kumar, Hong, Singh, Levine. Should I Run Offline Reinforcement Learning or Behavioral Cloning?
Better Intuition: Get order from chaos
e.g. Get object from a drawer ⇒ Singh, Yu, Yang, Zhang, Kumar, Levine. COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning.
Behavior Cloning“Stiching”
What can go wrong if we just use off-policy RL (on data collected by other agents)?
∇θJ(θ)≈N1i=1∑Nt=0∑Taccounts for difference in probability of landing in st,i(t′=0∏t−1πβ(at′,i∣st′,i)πθ(at′,i∣st′,i))∇θγtlogπθ(at,i∣st,i)accounts for incorrect Q^(st,i,at,i)(t′=t∏Tπβ(at′,i∣st′,i)πθ(at′,i∣st′,i))Q^(st,i,at,i)
Classic Advanced Policy Gradient RL completely strips away the first product term ⇒ assuming that the new policy πθ is not too far away from πβ
Probability of seeing (s’,a’) is the probability of starting with s’ and executing a’ or do transitions and arrive at the state s’ and executing a’
Solving for w(s,a) typically involves some fixed point problem
Suggested Reading
Classic work on importance sampled policy gradients and return estimation:
Precup, D. (2000). Eligibility traces for off-policy policy evaluation.
Peshkin, L. and Shelton, C. R. (2002). Learning from scarce experience.
Doubly robust estimators and other improved importance-sampling estimators:
Jiang, N. and Li, L. (2015). Doubly robust off-policy value evaluation for reinforcement learning.
Thomas, P. and Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforcement learning.
Analysis and theory:
Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. (2015). High-confidence off-policy evaluation.
Marginalized importance sampling:
Hallak, A. and Mannor, S. (2017). Consistent on-line off-policy evaluation.
Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2019). Off-policy policy gradient with state distribution correction.
Imitating good actions “more”(determined by w(s,a)) than bad actions
Problem: In order to get advantage values / target values we still need to query OOD actions
If we choose λ correctly the constraints will be respected at convergence, but not necessarily at intermediate steps
🧙🏽♂️
Can we avoid ALL OOD actions in the Q update?
Instead of doing
Q(s,a)←r(s,a)+Ea′∼πnew[Q(s′,a′)]
We can do:
Q(s,a)←r(s,a)+V(s′)
But how do we fit this V’ function?
V←argminVN1∑i=1Nl(V(si),Q(si,ai))
MSE Loss (V(si)−Q(si,ai))2
Problem is ⇒ the actions come from the exploration policy πβ not from πnew
Gives us Ea∼πβ[Q(s,a)]
Think about this:
Distribution visualization of states
We’ve probably seen only a bit or none of action in the “exact” same state
But there might be very “similar” states that we execute action on ⇒ There’s not bins of states but distributions of states
Can we use those “similar states” as extra data points for us to perform generalization on?
But what if we use a “quantile” / “expectile” ⇒ expectile is like a quantile squared
Upper expectile / quantile is the best policy supported by the data
Implicit Q-Learning
Expectile:
l2τ(x)={(1−τ)x2τx2if x>0otherwise
Intuition:
MSE penalizes positive and negative errors equally
expectile penalizes positive and negative with different weights depending on τ∈[0,1]
Weights given to positive and negative erros in expectile regression. Taken from IQL paper
Formally, it could be shown that if we use
V←VargminE(s,a)∼D[l2τ(V(s)−Q(s,a))]
In practice, Q(s,a) would be a target network
Formally, we can prove that provided with a big enough τ,
V(s)←a∈Ω(s)maxQ(s,a)
Where
Ω(s)={a:πβ(a∣s)≥ϵ}
Kostrikov, Nair, Levine. Offline Reinforcement Learning with Implicit Q-Learning. ‘21
📌
Oh yes, that’s my postdoc’s paper (as the first author)! This is absolutely an amazing paper and he is absolutely an amazing person, also this is one of the papers I read for entering into Sergey’s group. I hope you get to check out this paper!
Conservative Q-Learning (CQL)
Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning
Directly repair overestimated actions in the Q function
Intuition: Push down on areas that have errorneous high Q values
We can formally show that Q^π≤Qπ for large enough α
But with this objective, we’re pressing down on actually “good” actions as well ⇒ so add additional terms that pushes up on good samples supported by data
No longer guarenteed that ∀(s,a)Q^π(s,a)≤Qπ(s,a)
But ∀s∈D,Eπ(a∣s)[Q^(s,a)]≤Eπ(a∣s)[Qπ(s,a)]
In practice, we add a regularizer R(μ) in LCQL(Q^π) so that the μ doesn’t need to be calculated directly.
Effective analysis is very hard in RL without strong assumptions
Trick is to make assumptions that admit interesting conclusions without divorcing us (too much) from reality
What is the point:
Prove that our RL algorithms will work perfectly every time
Usually not possible with current deep RL methods, which are often not even guarenteed to converge
Understand how errors are affected by problem parameters
Do larger discounts work better than smaller ones?
If we want half the eror, do we need 2x the samples, 4x, or something else?
Usually we use precise theory to get imprecise qualitative conclusions about how various factors influence the performance of RL algorithms under strong assumptions, and try to make the assumptions reasonable enough that these conclusions are likely to apply to real problems (but they are not guarenteed to apply to real problems)
⚠️
Don’t take someone seriously if they say their RL algorithm has “provable guarantees”
the assumptions are always unrealistic, and theory is at best a rough guide to what might happen
Exploration
Performance of RL is greatly complicated by exploration - how likely are we to find potentially sparse rewards?
Theoretical guarantees typically address worst case performance ⇒ but worst case exploration is extremely hard
🧙🏽♂️
Goal: Show that exploration method is Poly(∣S∣,∣A∣,1/(1−γ))
If we “abstract away” exploration, how many samples do we need to effectively learn a model or value function that results in good performance?
“generative model” assumption: assume we can sample from P(s’∣s,a) for any (s,a)
“oracle exploration” assumption: For every (s,a), sample s’∼P(s’∣s,a) N times
Suppose X1,…,Xn are a sequence of independent, identically distributed (i.i.d.) random variables with mean μ. Let Xnˉ=n−1∑i=1nXi. Suppose that Xi∈[b−,b+] with probability 1, then
P(Xˉn≥μ+ϵ)≤exp{−(b+−b−)22nϵ2}
Therefore if we estimate μ with n samples the probability we’re off by more than ϵ is at most 2exp{(b+−b−)22nϵ2} (since we can be off on both sides). So if we want this probability to be δ:
Let z be a discrete random variable that takes values in {1,2,…,d}, distributed according to q. We can write q as a vector where q=[P(z=j)]j=1d. Assume we have N iid samples, and that our empiraical state of q is [q^]j=∑i=1N1{zi=j}/N
We have that ∀ϵ>0,
P(∣∣q^−q∣∣2≥N1+ϵ)≤exp{−Nϵ2}
Which implies:
P(∣∣q^−q∣∣1≥d(N1+ϵ))≤exp{−Nϵ2}
Let
δ=P(∣∣q^−q∣∣1≥d(N1+ϵ))
Then
δϵN≤exp{−Nϵ2}≤N1logδ1≤ϵ21logδ1
Using concentration inequalities, we see that in the “Basic Sample Complexity Analysis”, the state transition estimation difference is
So if we write out the transition fuction P, the Q function Qπ, reward function r, and value function Vπ as matrices:
Dims of each matrix
Qπ=r+γPVπ
Remember that V function is just an expectation over the Q function (a sum) ⇒ so we can also write it as an expectation (where Π∈R∣S∣×(∣S∣∣A∣) and is a probability matrix)
Vπ=ΠQπ
With this, (and Pπ=PΠ)
Qπ=r+γPπQπ
So
(I−γPπ)QπQπ=r=(I−γPπ)−1r
And turns out the matrix (I−γPπ) is always gonna be invertible
Simulation Lemma:
Qπ−Q^π=γ(I−γP^π)−1Evaluation Operator: Turns reward function into Q functionDifference in model probability(P−P^)Vπ
Note: We will not be computing P^ and r^, it would just be the effect of averaging together transitions in the data (having different gradients for different samples)
We will assume an infinity norm here because there will be no convergnce if using L2 norm
∣∣Q^k−Q∗∣∣∞Unroll the recursion,=∣∣Q^k−TQ^k−1+TQ^k−1−Q∗∣∣∞=∣∣(Q^k−TQ^k−1)+(TQ^k−1−Using the fact that Q∗ is a fixed point of TTQ∗)∣∣∞≤∣∣Q^k−TQ^k−1∣∣∞+∣∣TQ^k−1−TQ∗∣∣∞≤ϵk−1+Using the fact that T is a γ-contractionγ∣∣Q^k−1−Q∗∣∣∞≤i=0∑k−1γiϵk−i−1+γ∣∣Q^k−1−Q∗∣∣∞